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Abstract 

We present a simple theoretical framework, and corresponding prac¬ 
tical procedures, for comparing probabilistic models on real data in a 
traditional machine learning setting. This framework is based on the the¬ 
ory of proper scoring rules, but requires only basic algebra and probability 
theory to understand and verify. The theoretical concepts presented are 
well-studied, primarily in the statistics literature. The goal of this paper is 
to advocate their wider adoption for performance evaluation in empirical 
machine learning. 


1 Why probabilistic predictions? 

When a model is applied to a situation where uncertainty is inherent (e.g. pre¬ 
dicting a biased coin flip, or a user’s next click), a probability distribution should 
be its output. Accurate probability distributions provide more information than 
point predictions, and are the natural product of Bayesian models. Our goal 
is not to advocate probabilistic models per se, but to show in an accessible 
way that their output can be evaluated rigorously with no more difficulty than 
deterministic labelings in classification problems. 

2 Comparing models 

Where do observations come from? They are based on the state of the world. 
This state describes the situation in which a model is asked to make a prediction. 

( 1 ) 

The support of this distribution S over states is likely infinite and uncountable. 
If we are predicting the weather then a state cr includes a description of physical 
phenomena that could affect future weather patterns. If we are predicting which 
ad a user will click on, a state includes factors influencing the user’s decision: 
personality, past history, web page design, and so on. The distribution is entirely 
theoretical, and need never be described formally. 
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Based on the state of the world cr, an outcome is observed and recorded. 
However, the outcome is not necessarily implied deterministically by cr. Rather, 
there is a distribution over possible outcomes: 

X'-- fa (2) 

This includes the possibility of a degenerate distribution (probability 1 on a 
single outcome), but does not require it. Uncertainty could stem from true 
randomness (e.g. quantum noise) or from ignorance (e.g. the model does not 
know what the user ate for breakfast). The noise distribution fa- is again entirely 
theoretical, and need never be described. 

Equations 0 and (I2|) define a generative framework for observations. When 
scoring the probabilistic predictions of a model, we will typically have a single 
observation from each of many different states of the world cti ... cr„ (although 
states drawn multiple times pose no problem). That is, we have a set X oi n 
observations: 

X = {Xa^ ■■■Xa^} (3) 

For convenience, we will assume that these observations are discrete, but a 
generalization to real-valued observations is possible. Corresponding to each of 
these observations are predictions from each of the models we are evaluating. 
For simplicity, we assume two models, g and k. 

G = {gai ■ ■ ■ ga„} ( 4 ) 

K = {kai ■ ■. ka„} ( 5 ) 

Here gai is the distribution that model g predicts in the state CTi where the 
observed outcome is Xa ^, and likewise for the remainder of the observations and 
for model k. The theoretical assumption is that Xa^ ~ fat, but the states cr^ 
need no description for the purposes of model evaluation and we never need to 
construct fa^ explicitly. To say that model g is “better” than model A:, we would 
like to conclude that it has a lower divergence from the true distribution / in 
expectation for some divergence function d: 

Ear^sWaWga)] < Ear^sWaWha)] ( 6 ) 

Since we have a finite number n of samples, we can only determine probabilisti¬ 
cally if this inequality holds. Examples of d for which this estimation task is pos¬ 
sible using only X, G, and K are squared Euclidean distance d{p\\g) = ||p —gip 
and KL-divergence d{p\\g) = ^jPj^ri^. However, it is not immediately obvi¬ 
ous how the truth of the inequality in (j6|) can be evaluated, even probabilistically, 
without access to the true distributions fa^ , • ■ • > fa„ ■ However, only simple al¬ 
gebra is required. For KL-divergence, we first approximate the expectation of 
the log probability assigned by model g (the derivation for k is identical) to the 
true outcome x, that is: 


Ear^S [Ex'^f^ [ fci(^cr,3:)]] 


( 7 ) 
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This expression can be approximated from G and X : 
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( 8 ) 


Where gai,x„. is the probability that model g assigned to the true outcome 
(corresponding in the theoretical model to state Ui). The trick is that ([7]) is 
equivalent to expected KL-divergence plus a constant: 


Ea-^S 'y ^ f(7,j ( In ga,j ) 


j 




Ecr^S [d-KhifaWga) + 


Here H{f) = is the Shannon entropy of /. Since H{f„) is in¬ 

dependent of a model’s predictions, differences in 0 between models g and k 
must be due to differences in the expected KL-divergences Ec^r^sld-KhifaWga-)] 
and Ecrr^sld-KhifaWha)]- The only remaining complication is the finite sample: 
how can we be sure that observed differences in ([8|) are due to differences in 0? 

This is a standard statistical task: we have a set of n paired samples 
(—In^cTi.x^.) ~ in ) related by the state of the world Ci for each sample, 
and want to test whether the expectation of the g samples is significantly less 
than that of the k samples (meaning g is a better model). A paired t-test or the 
Wilcoxon signed-rank test (although it tests the median rather than the mean) 
are reasonable options. 

This simple algebraic t r ick co mes out of the theory of proper scoring rules 
(see iGneiting and Rafterv |2007l | for a thorough survey). Scoring rules were 
developed to incentivize true reporting of probabilities by experts: first a re¬ 
port is solicited in the form of a probability distribution q, then an outcome is 
observed. The expert is paid based on their report and the outcome, accord¬ 
ing to the scoring rule. A proper scoring rule incentivizes an expert to report 
truthfully (which is not the case if the expert is paid e.g. Qi for an outcome 
i, often referred to as the naive scoring rule). Any proper scoring rule has an 
associated divergence function, which for the logarithmic scoring rule (In Qi for 
an observed outcome i) is KL-divergence. The divergence function associated 
with the quadratic scoring rule 2(7^ — ||g|p is squared Euclidean distance, which 
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can also be derived with only simple algebra: 


\Exr^f^ [ ‘^ga,x + llffo-ll ]] 

(9) 

■E.^S [-Va • 9a,j + ||5<.||" + ll/<.||" - WUW^] 

(10) 

E.^S [\\fa-ga\\^-\\fa\\^] 

(11) 


(12) 


As with the logarithmic scoring rule, we get a divergence function ||/o- — ffo-lP 
and a generalized entropy term H/o-lP which again is independent of the model’s 
predictions. 


3 Procedure summary 

While theoretically justifying probabilistic model comparisons is slightly tedious, 
the procedure could not be simpler. To summarize: 

• For every held-out observation, score each model’s predicted distribution 
q: — In(qi) for logarithmic, or —2gi + ||g|p for quadratic given that outcome 
i is observed 

• Perform a (typically paired) statistical test to determine whether the 
scores for one model are significantly lower than those for the other, lower 
indicating a better model 

When comparing more than two models, perform as many pairwise tests as 
necessary. The “figure of merit” for a model is its mean score (e.g. ® for the 
logarithmic scoring rule), lower implying less divergence from the unobserved 
true distributions of observations and therefore a better model (modulo noise 
in the estimate). 

4 Choosing a divergence function 

We have presented two of the most common choices for scoring rules, quadratic 
and logarithmic, corresponding to evaluation with squared Euclidean distance 
and KL-divergence respectively. The former is of interest when KL-divergence is 
too quick to dismiss models which put zero probability on observed outcomes. 
While these models are clearly “wrong,” i.e. provably not reporting the true 
distribution of observations, this is usually not our main concern (“all models 
are wrong, some are useful”). Often we want to compare against model-free 
baselines (e.g. observed frequencies) which do report zero probability on ob¬ 
served outcomes, and adding parameters to hedge their reports is undesirable. 
When this is not an issue, KL-divergence is desirable due to its popularity and 
connections to information theory. 

Other proper scoring rules exist, and their divergence functions can be used 
in model comparisons just as for the logarithmic and quadratic scoring rules. 
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For example, cosine similarity corresponds to the spherical scoring rule. In 
general, any Bregman divergence is a feasible way of comparing probabilistic 
models (i.e. has an associated proper scoring rule). 

5 Alternatives 

One popular method of scoring probabilistic models is perplexity, which is sim¬ 
ply an exponentiated version of ([5]). This exponentiation rewards slight over¬ 
reporting of high probability events, but the effect diminishes rapidly with in¬ 
creasing dataset size. Nonetheless, it is theoretically preferable to use the un¬ 
exponentiated version for model comparisons. 

There are many popular ways of scoring non-probabilistic predictions based 
on classification accuracy, precision and recall, and so on. These methods can 
be applied to probabilistic models, for example by ranking outcomes by their 
reported probability. However, such procedures discard much of the informa¬ 
tion probabilistic predictions provide, and so are generally less desirable when 
choosing between probabilistic models. 
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